How to visualise data in R

Published

Last updated: 20/12/2024 10:05

A compendium of code for visualising data in R (primarily using ggplot2).

Note: this guide uses built-in datasets from the following, amongst others:

1 Structure of a ggplot

  • Aesthetics describe mapping between visual elements and variables in the data, e.g. x-axis may be mapped to “time_point”, while colour may be mapped to “gender”.
  • Geoms are the type of visual ‘marks’ on a plot such as lines, points, or bars: they are geometrical objects used to represent data.
data %>% 
  ggplot(aes(x = ind_var, 
             y = dep_var)) +
  geom_point(aes(colour = factor(grouping_var)),
             size = 1.5)

If you want to add a global attribute (e.g. to apply to all points, lines, or whatever), specify this outside of aes() because it is not a mapping (it doesn’t relate to something in the dataframe itself).

2 Plot types

2.1 Points

mpg %>%
  ggplot(aes(x = displ, y = hwy)) +
  geom_point(size = 1.5,
             position = "identity")

This masks the picture because there are overlapping points. Changing position to "jitter":

Adding mappings for colour and shape:

mpg %>%
  ggplot(aes(x = displ, y = hwy)) +
  geom_point(aes(colour = class,
                 shape = class),
             size = 1.5,
             position = "jitter") 

Note that this can also be written as:

mpg %>%
  ggplot() +
  geom_point(aes(x = displ, y = hwy, 
                 colour = class,
                 shape = class), 
             size = 1.5)

2.1.1 Point shapes

Note: the fill attribute can be changed to any colour for shapes 21 to 24.

2.2 Lines

gcookbook::countries %>%
  filter(Name %in% c("United Kingdom", "Ireland") & Year > 1980) %>%
  ggplot(aes(x=Year, y=GDP)) +
  geom_line(aes(colour = Name, 
                linetype = Name), 
            linewidth = 0.8)

# to add arrows to the ends of lines:
# arrow = arrow(length = unit(0.25, "cm"), ends = "last", type = "closed")

Adding a line of best fit

gcookbook::heightweight %>%
  ggplot(aes(x=heightIn, y=weightLb)) +
  geom_point(aes(colour=sex)) +
  geom_smooth(method = "lm",
              fullrange = T,
              aes(colour=sex))

Use geom_segment to add lines from points to fitted regression slope

ggplot(df, aes(x=x, y=y)) + 
  geom_point() + 
  theme_classic() + 
  geom_smooth(method="lm", se=F) + 
  geom_segment(aes(x = x, y = y,
                   xend = x, yend = Fitted), linetype = "dashed")

If you want to sum a value across factor levels using a linegraph, you need to use stat_summary:

df %>%
  ggplot(aes(x=year, y=value)) +
  stat_summary(fun = "sum", geom = "line")

2.2.1 Linetypes

2.3 Bars

2.3.1 Simple bar chart

  • geom_col leaves the data as it is and merely represents values already in the dataframe
  • geom_bar uses stat_count to derive new values from the data. As a result, geom_bar doesn’t expect a y-value, but if you provide one then you are telling it to forgo the aggregation it would have done anyway with stat_count.

Using stat = "count"

palmerpenguins::penguins %>%
  ggplot(aes(x=species, fill=species)) +
  geom_bar(stat="count", 
           width = 0.8,
           show.legend = F)

Using stat = "identity"

gcookbook::drunk %>%
  pivot_longer(c(2:6)) %>%
  group_by(sex) %>%
  summarise(felonies = sum(value)) %>%
  ggplot(aes(x = sex, y = felonies)) +
  geom_bar(stat = "identity", width = 0.8)

You should also use stat_identity if you want to reorder bars in descending order.

palmerpenguins::penguins %>%
  group_by(species) %>%
  summarise(n = n()) %>%
  ggplot(aes(reorder(species, -n), n)) +
  geom_bar(stat = "identity", width = 0.8, aes(fill = species),
           show.legend = F)

2.3.2 Dodged bar chart

palmerpenguins::penguins %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = position_dodge(),
           width = 0.8)

Note that the above graph contains NAs. To remove these, use the subset function:

penguins %>%
  drop_na(sex) %>%
  ggplot(aes(x = island, fill = sex)) + ...

2.3.3 Stacked bar chart

A recipe for when you already have a y-axis (e.g. counts or figures for each category of something).

palmerpenguins::penguins %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = position_stack(), 
           width = 0.8)

2.4 Proportional stacked bar

When there is an explicit y-value:

prop_df
group type count
a x 11091
a y 4583
b x 3974
b y 10984
prop_df %>%
  ggplot(aes(x = group, y = count, fill = type)) +
  geom_bar(position = "fill", stat="identity", width = 0.8) +
  scale_y_continuous(labels = scales::label_percent(accuracy = 1,
                                                    scale = 100), 
                     breaks = c(0, 0.25, 0.5, 0.75, 1)) +
  ylab("Proportion")

When there is no explicit y-value (i.e. counts of a factor)

palmerpenguins::penguins %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = position_fill(),
           width = 0.8) +
  scale_y_continuous(labels = scales::label_percent(accuracy = 1,
                                                    scale = 100), 
                     breaks = c(0, 0.25, 0.5, 0.75, 1)) +
  ylab("Proportion")

Text is dealt with later on, but to add figures to the above figure you need to manually calculate proportions for each cell:

palmerpenguins::penguins %>%
  group_by(island, sex) %>%
  summarise(n = n()) %>% # you may need to use sum() here instead of n()
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = island, y = prop, fill = sex)) +
  geom_bar(stat = "identity", width = 0.8) +
  geom_text(aes(label =
                  paste0(sprintf("%.1f", prop * 100), "%")), 
            position = position_stack(vjust = 0.5),
            size = 3.5) +
  ylab("Proportion")

# note: to change decimal places, "%.0f", "%.2f" etc

2.5 Boxplot

palmerpenguins::penguins %>%
  ggplot(aes(x = species, y = bill_length_mm)) +
  stat_boxplot(
    geom ='errorbar', 
    width = 0.5) +
  geom_boxplot(
    notch = F,
    outlier.color = "red",
    outlier.size = 3
  )

2.6 Violin plot

palmerpenguins::penguins %>%
  ggplot(aes(x=species, y=bill_length_mm)) +
  geom_violin(aes(fill = species),
              draw_quantiles = c(0.5)) # 0.5 is median

2.7 Dotplot

For when there isn’t a huge amount of data. Each dot represents a single observation.

palmerpenguins::penguins %>%
  sample_n(125, replace=F) %>%
  ggplot(aes(x=body_mass_g)) +
  geom_dotplot(dotsize = 1, width = 1)

2.8 Histogram

For when there’s more data, or binning is better.

palmerpenguins::penguins %>%
  ggplot(aes(x=body_mass_g)) +
  geom_histogram(binwidth = 100, 
                 fill = "grey", 
                 colour = "black")

By default, geom_histogram will stack bars if there are multiple groups. You can change this by specifying position to identity:

ggplot2::diamonds %>%
  ggplot(aes(x = carat, fill = cut)) +
  geom_histogram(position = "identity", alpha = 0.4)

Example of a function to iterate through multiple histograms, by group, using purrr:

# firstly make a vector of all the variable names you want to plot
vars = c("bill_length_mm", "flipper_length_mm")

# then create a custom graphing function
hist_fun = function(data, x, y) {
  ggplot(data, aes(x = .data[[x]], fill = .data[[y]]) ) +
    geom_histogram(alpha=0.5, position = "identity") +
    theme_bw() +
    ggtitle(x)
}

# use purrr::map to cycle through vars to produce plots, with a constant grouping factor
purrr::map(vars, ~ hist_fun(data = palmerpenguins::penguins, .x, "sex") )
[[1]]


[[2]]

2.9 Density

For large amounts of data. Use bw argument to change bandwidth (e.g. if you want more smoothing).

ggplot2::diamonds %>%
  ggplot(aes(x = carat)) +
  geom_density(fill = "lightblue", 
               alpha=0.8)

You may want to overlay distributions for levels of a factor.

ggplot2::diamonds %>%
  ggplot(aes(x = carat, fill = cut)) +
  geom_density(alpha = 0.4)

2.10 Pie

palmerpenguins::penguins %>%
  count(species) %>%
  ggplot(aes(x = "", y = n, fill = species)) +
  geom_col(color = "black") +
  coord_polar(theta = "y") +
  geom_text(aes(label = n),
            position = position_stack(vjust = 0.5)) +
  theme_void()

2.11 Sankey

library(networkD3)

links = data.frame(
  Var1 = c("Failed_Phonics", "Passed_Phonics", "Failed_Phonics", "Passed_Phonics"),
  Var2 = c("Failed_KS2", "Failed_KS2", "Passed_KS2", "Passed_KS2"),
  Freq = c(25114, 120356, 9019, 585893)
)

colnames(links) <- c("source", "target", "value")

nodes <- data.frame(name = as.factor(c("Failed_Phonics","Passed_Phonics","Failed_KS2","Passed_KS2")))

# Convert the source and target from characters to indices
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1

# create Sankey diagram
sankeyNetwork(Links = links, 
              Nodes = nodes, 
              Source = "source",
              Target = "target", 
              Value = "value", 
              NodeID = "name", 
              fontSize = 12, 
              fontFamily = "sans-serif")

3 Some more advanced recipes

Add histogram to scatterplot using ggExtra::ggMarginal

set.seed(30)
df1 = data.frame(x = rnorm(500, 50, 10), y = runif(500, 0, 50))
p1 = ggplot(df1, aes(x, y)) + geom_point() + theme_bw()

ggMarginal(p1, type = "histogram",
           margins = "both", # note: "x", "y", or "both" 
           fill = "orange",
           xparams = list(binwidth = 2))

Create scatterplot matrix using GGally::ggpairs.

# note: you can also add colouring by factor if you add: ggplot2::aes(colour = Species)
GGally::ggpairs(iris)

4 Plotting statistical summary data

It’s possible to plot different statistical summaries within ggplot2, for instance the median.

palmerpenguins::penguins %>%
  ggplot(aes(x=species, y=flipper_length_mm)) +
  geom_bar(fun = "median", 
           stat = "summary")

However, a more powerful method is to use stat_summary.

palmerpenguins::penguins %>%
  ggplot(aes(x=species, y=flipper_length_mm)) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun = "mean", geom = "line", aes(group=1)) 

Plot mean and standard deviation using mean_sdl (use the mult argument to change how many standard deviations are shown around the mean):

mpg %>%
  ggplot(aes(x = reorder(class, hwy), y = hwy)) +
  stat_summary(fun = mean,
               geom = "point") +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(
                 mult = 1),
               geom = "errorbar",
               width = .4)

Add confidence intervals (requires the Hmisc package):

mpg %>%
  ggplot(aes(x = reorder(class, hwy), y = hwy)) +
  stat_summary(fun = "mean", 
               geom = "point") +
  stat_summary(fun.data = "mean_cl_normal",
               fun.args = list(
                 conf.int = .95),
               geom = "errorbar",
               width = .4) 

4.1 Pointrange

palmerpenguins::penguins %>%
  ggplot() +
  geom_pointrange(mapping = aes(x = species, y = flipper_length_mm),
                  stat = "summary",
                  fun.ymin = min,
                  fun.ymax = max,
                  fun.y = median)

5 Working with text

5.1 Plot title, subtitle, and caption

p + ggtitle("Main plot title", subtitle = "Plot subtitle")

# use labs to add a caption
p + labs(caption = "my caption")

# caption is right-justified by default. To change:
p + theme(plot.caption = element_text(hjust=0))

# centre align plot title
p + theme(plot.title = element_text(hjust = 0.5))

# change plot title size
p + theme(plot.title = element_text(size = 18))

5.2 Adding text to charts

palmerpenguins::penguins %>%
  group_by(species) %>%
  summarise(n = n()) %>%
  ggplot(aes(x = species, y = n)) +
  geom_bar(stat = "identity", width = 0.8) +
  geom_text(
    aes(label = n,
        vjust = -0.5),
    size = 3.5)

6 Axes

Note that the same functions are used for x or y axes - adjust accordingly.

# custom axis titles
p + xlab("...")
p + ylab("...")

# remove axis title
p + theme(axis.title.x = element_blank())

# change axis title text size
p + theme(axis.title.x = element_text(size = 14))

# rotate x- axis title text 45 degrees
p + theme(axis.title.x = element_text(angle = 45, vjust = 0.5))

# rotate y-axis to 0 degrees (to be read horizontally not vertically)
p + theme(axis.title.y = element_text(angle = 0, vjust = 0.5))

# remove tick marks
p + theme(axis.ticks.x = element_blank())

6.1 Axis formatting options

# percent labels with breaks of 10-100%
# note: change scale to 0 if you are working with integers already, otherwise leave at default of 100
p + scale_y_continuous(labels = scales::label_percent(accuracy = 1, scale = 100), breaks = (0:10)/10)

# currency suffix
p + scale_y_continuous(labels = scales::label_dollar(prefix = "£"))
# can also specify suffix, big.mark = ",", and decimal.mark = "."

# dates and times
p + scale_y_continuous(labels = label_date(format = "%Y-%m-%d"))
p + scale_y_continuous(labels = label_time(format = "%H:%M:%S"))

# thousands separator
p + scale_y_continuous(labels = label_comma(big.mark = ","))

6.2 Limits and ranges of axes

# xlim, ylim, coord_cartesian (doesn't clip data)
# scale_x_continuous(limits = c(0, 100))

coord_cartesian and coord_fixed in action:

iris %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() +
  geom_smooth(aes(colour = Species)) +
  coord_cartesian(xlim = c(4.5, 5.5))

# a 1-to-1 ratio would be a value of 1. A value <1 compresses the y-axis.
# Also try e.g. coord_fixed(20/1), i.e. a 20-to-1 ratio where the x-axis is 20 times
# as long as the y-axis
  
iris %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() +
  geom_smooth(aes(colour = Species)) +
  coord_fixed(0.5)

A tip: if you want a y-axis to begin at zero but don’t know the maximum, you can do this:

p + expand_limits(y = 0)

When you have a discrete x-axis, sometimes you get a lot of space between aesthetics and the edges of the plot. To control the spacing, add scale_*_discrete(expand = c(0.1, 0.1)):

mpg %>%
    mutate(year = as.factor(year)) %>%
    group_by(year) %>%
    summarise(cty_m = mean(cty, na.rm = T)) %>%
    pivot_longer(c(cty_m)) %>%
    ggplot(aes(x = year, y = value)) +
    stat_summary(fun = "mean", geom = "line", aes(group = 1)) +
    scale_x_discrete(expand = c(0.1, 0.1))

Clipping determines whether to display elements that would lie outside the plot panel. Expanding sets a buffer margin around a plot to prevent overlapping.

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(size = 2) +
  # zero expansion
  coord_cartesian(expand = 0,
                  clip = "off") +
  ggtitle("Clipping off")

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(size = 2) +
  # zero expansion
  coord_cartesian(expand = 0,
                  clip = "on") +
  ggtitle("Clipping on")

Use custom labels for x-axis: say you have numeric year labels in the data such as 202122 but you want to display these as “2021/22”.

# define a list of values
yr_labels = c("2018/19", "2019/20", "2020/21", "2021/22", "2022/23")

plot +
  scale_x_continuous(labels = yr_labels)

7 Themes

To add a pre-defined theme to a plot: p + theme_*().

Tips:

  • Place the theme argument early on in the plot sequence if you want to adjust other features e.g. axis attributes, because otherwise theme will override these.
  • Set a global theme: theme_set(theme_classic(base_size = 16)).

7.1 Analysis function theme (afcharts)

The AF has created its own package for producing accessible plots, based on ggplot2. The main function is theme_af(). It has the following arguments:

theme_af(
  base_size = 14,
  base_line_size = base_size/24,
  base_rect_size = base_size/24,
  grid = c("y", "x", "xy", "none"),
  axis = c("x", "y", "xy", "none"),
  ticks = c("xy", "x", "y", "none"),
  legend = c("right", "left", "top", "bottom", "none")
)
afcharts::use_afcharts()
NULL
palmerpenguins::penguins %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = position_dodge(),
           width = 0.8) +
theme_af(legend = "bottom")

gapminder %>%
  filter(country %in% c("United Kingdom", "China", "Togo", "Bangladesh")) %>%
  ggplot(aes(x = year, y = lifeExp, colour = country)) +
  geom_line(linewidth = 1) +
  theme_af(legend = "bottom") +
  scale_colour_discrete_af() +
  scale_y_continuous(limits = c(0, 82),
                     breaks = seq(0, 80, 20),
                     expand = c(0, 0)) +
  scale_x_continuous(breaks = seq(1952, 2007, 5)) +
   labs(
    x = "Year",
    y = NULL)

You will need to reset the default ggplot2 theme if you don’t want to use afcharts any more.

ggplot2::theme_set(theme_grey())

7.2 A custom theme I like

# make this into a custom function that can be applied to any plot
theme_chris = function(){
  theme(
    axis.line = element_line(colour = "grey80"),
    panel.grid.major.y = element_line(colour = "grey90"),
    panel.grid.major.x = element_blank(), 
    panel.background = element_rect(fill = "white", colour = NA))
}

p + theme_chris()

8 Legends

You can control the palette, breaks, labels, and name. E.g. if the factor labels are too long, you can shorten them. Use a separate call to contol other attributes like linetype or fill.

gcookbook::countries %>%
    filter(str_detect(Name, "United")) %>%
    ggplot(aes(x = Year, y = GDP)) + 
    geom_line(aes(colour = Name), linewidth = 0.8) +
    scale_colour_manual(values = c(RColorBrewer::brewer.pal(3, "Set2")),
                        breaks = c("United Kingdom", "United Arab Emirates", "United States"),
                        labels = c("UK", "UAE", "US"),
                        name = "Country") +
  theme_minimal()

8.1 Direct labels

Sometimes a legend isn’t necessary or the right aesthetic choice. Instead, it’s possible to append labels directly to dots or lines using directlabels or ggrepel.

Using directlabels and geom_dl. Note that other method options include first.points and last.qp which adjusts the size of the text automatically). This also requires clipping to be turned off and for the plot margins to be extended (otherwise text won’t be displayed properly).

df %>%
  ggplot(aes(x = time_period, y = total_exam_entries)) +
  geom_line(aes(colour = subject), linewidth = 0.8) +
  ylab("Entries") + 
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle=270, vjust = 0.5),
        legend.position = "none",
        axis.line = element_line(colour="black"),
        panel.grid.minor = element_blank()) +
  scale_y_continuous(labels = scales::label_comma(big.mark = ","),
                     limits = c(0,80000)) +
  scale_x_continuous(breaks = c(200910, 201011, 201112, 201213, 201314, 201415, 201516, 201617, 
                                201718, 201819, 201920, 202021, 202122, 202223),
                     labels = c("2009/10", "2010/11", "2011/12", "2012/13", "2013/14", "2014/15",
                                "2015/16", "2016/17", "2017/18", "2018/19", "2019/20", "2020/21",  
                                "2021/22", "2022/23")) +
  geom_dl(aes(label=subject), method=list("last.points", "bumpup", cex=0.8)) +
  coord_cartesian(clip="off") +
  theme(plot.margin = unit(c(1,4,1,1), "lines")) 

Using geom_label_repel. Note that this requires a bit more wrangling to make sure only the final points are displayed (using the data argument) - otherwise every single point will be labelled.

df %>%
  ggplot(aes(x = time_period, y = total_exam_entries, label = subject)) +
  geom_line(aes(colour = subject), linewidth = 0.8) +
  ylab("Entries") + 
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle=270, vjust = 0.5),
        legend.position = "none",
        axis.line = element_line(colour="black"),
        panel.grid.minor = element_blank()) +
  scale_y_continuous(labels = scales::label_comma(big.mark = ","),
                     limits = c(0,80000)) +
  scale_x_continuous(breaks = c(200910, 201011, 201112, 201213, 201314, 201415, 201516, 201617, 
                                201718, 201819, 201920, 202021, 202122, 202223),
                     labels = c("2009/10", "2010/11", "2011/12", "2012/13", "2013/14", "2014/15",
                                "2015/16", "2016/17", "2017/18", "2018/19", "2019/20", "2020/21",  
                                "2021/22", "2022/23")) +
  coord_cartesian(clip = "off") +
  geom_label_repel(aes(label = subject), 
                   label.padding = .15, 
                   data = df %>% group_by(subject) %>% filter(time_period == max(time_period)),
                   size = 3, hjust = 0.5, nudge_x=0.5)

9 Other things

9.1 Flip coordinates

ggplot2::diamonds %>%
  ggplot(aes(x=cut, y=carat)) +
  geom_violin() +
  coord_flip()

10 Using colours

10.1 Manual

Quick manual palettes for sequential and categorical variables (adjust to number needed).

seq2 = c("#12436D", "#6BACE6")
seq3 = c("#12436D", "#2073BC", "#6BACE6")

cat4 = c("#12436D", "#28A197", "#801650", "#F46A25", "#3D3D3D", "#A285D1")
# Sequential palettes
blues = c("#104F75", "#407291", "#7095AC", "#9FB9C8", "#CFDCE3")
reds = c("#8A2529", "#A15154", "#B97C7F", "#D0A8A9", "#E8D3D4")
oranges = c("#E87D1E", "#ED974B", "#F1B178", "#F6CBA5", "#FAE5D2")
yellows = c("#C2A204", "#CEB536", "#DAC768", "#E7DA87", "#F3ECCD")
greens = c("#004712", "#336C41", "#669171", "#99B5A0", "#CFDABD")
purples = c("#260859", "#51397A", "#7D6B9B", "#A89CBD", "#D4CEDE")

show_col(c(blues, reds, oranges, yellows, greens, purples), ncol=5, cex_label=0.7)

scale_*_manual for manually-created palettes

# example of sequential. Tip: use rev() to reverse the order of colours
ggplot2::diamonds %>%
  filter(price < 6000) %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(position = "dodge", binwidth = 1000) +
  scale_fill_manual(values = rev(yellows))

An example of how to specify manual linetype and colour. Note that these must use the same breaks in order to be presented as a single legend (as opposed to separate legends for linetype and for colour):

scale_linetype_manual(name = "",
                        values = c("solid", "dashed"),
                        breaks = c("pay", "median_after_tax"),
                        labels = c("Pay", "Median UK")) +
scale_colour_manual(name = "",
                        values = c("black", "red"),
                        breaks = c("pay", "median_after_tax"),
                        labels = c("Pay", "Median UK"))

Use a focus palette to bring attention to one particular series, while keeping the other data there for comparison.

focus = c("#12436D", "#BFBFBF", "#BFBFBF") # contrast ratio of 5.57:1 (passes Web Content Accessibility
# Guidelines [WCAG]).
# Add as many grey colours as you need.

countries %>% 
  filter(str_detect(Name, "United")) %>%
  ggplot(aes(x=Year, y=GDP)) + 
  geom_line(aes(colour = Name), size = 0.8) +
  scale_colour_manual(values = focus,
                        breaks = c("United Kingdom", "United Arab Emirates", "United States"),
                        labels = c("UK", "UAE", "US"),
                        name = "Country")

Alternatively, use gghighlight:

countries %>% 
  filter(str_detect(Name, "Republic")) %>%
  ggplot(aes(x=Year, y=GDP)) + 
  geom_line(aes(colour = Name), size = 0.8) +
  gghighlight(str_detect(Name, "Czech"),
              label_params = list(size = 3),
              label_key = Code)

10.2 Other colour scale options

# if a fill or colour variable is continuous, the default scale will be scale_*_continuous,
# so there's no need to specify it directly:
p = gcookbook::heightweight %>%
  ggplot(aes(x = heightIn, y = weightLb, colour = heightIn)) +
  geom_point(size = 2)

p +
  #scale_colour_continuous() +
  theme_void() +
  ggtitle("scale_colour_continuous")

p +
  scale_colour_gradient(low = "white", high = "red", na.value = "black") +
  theme_void() +
  ggtitle("scale_colour_gradient")

# scale_colour_gradient2 allows you to set a mid-point colour but you have to specify this from
# the dataframe. Of course you don't have to use the median as the 'midpoint' here. 
p +
  scale_colour_gradient2(low = "blue", mid = "white", high = "red",
  midpoint = median(gcookbook::heightweight$heightIn)) +
  theme_void()

# colour_gradientn allows you to specify your own set of colours for a whole spectrum.
# note: if you don't specify a vector of values, the colours will be evenly positioned along
# the scale. However, you can control this manually for instance if you want to highlight the very 
# top values. Just specify a vector of values between 0 and 1 to correspond to how you want the 
# colours mapped to the scale, e.g. values = c(1, 0.9, 0.8, 0.7, 0)
p +
  scale_colour_gradientn(colours = c("red","yellow","green","lightblue","darkblue"),
                                    values = c(1, 0.9, 0.8, 0.7, 0)) +
  theme_void() +
  ggtitle("scale_colour_gradientn")

10.3 Brewer palettes

scale_*_brewer to use predefined palettes. See a list of available palettes here.

10.3.1 Brewer monochrome palettes:

  • Blues
  • Greens
  • Purples
  • Greys
  • Oranges
  • Reds
ggplot2::diamonds %>%
  filter(price <6000) %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(position = "dodge", binwidth = 1000) + 
  scale_fill_brewer(palette = "Greens") +
  theme_void()

10.3.2 Brewer spectral palettes

  • BuGn
  • BuPu
  • GnBu
  • OrRed
  • PuBu
  • PuRd
  • RdPu
  • YlGn
  • PuBuGn
  • YlGnBu
  • YlOrBr
  • YlOrRd
ggplot2::diamonds %>%
  filter(price <6000) %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(position = "dodge", binwidth = 1000) + 
  scale_fill_brewer(palette = "YlOrRd") +
  theme_void()

10.3.3 Brewer diverging palettes

  • Spectral
  • RdYlGn
  • RdYlBu
  • RdGy
  • RdBu
  • PuOr
  • PRGn
  • PiYG
  • BrBG
ggplot2::diamonds %>%
  filter(price <6000) %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(position = "dodge", binwidth = 1000) + 
  scale_fill_brewer(palette = "Spectral") +
  theme_void()

10.3.4 Brewer qualitative palettes

  • Accent
  • Dark2
  • Paired
  • Pastel1 and Pastel2
  • Set1
  • Set2
  • Set3
ggplot2::diamonds %>%
  filter(price <6000) %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(position = "dodge", binwidth = 1000) + 
  scale_fill_brewer(palette = "Set2") +
  theme_void()

Example of a categorical palette from the Analysis Function.

# note: recommendation is to use a max of 4 colours
categorical = c("#12436D", "#28A197", "#801650", "#F46A25", "#3D3D3D", "#A285D1")
show_col(categorical)

How to use a custom palette:

ggplot2::midwest %>%
  filter(county %in% c("ADAMS", "ALEXANDER", "BOND", "BOONE", "BROWN")) %>%
  ggplot(aes(x = county, y = poptotal, fill = county)) +
  geom_col(position = position_dodge()) +
  scale_fill_manual(values = categorical)

Using scale_fill_brewer palettes:

ggplot2::midwest %>%
  filter(county %in% c("ADAMS", "ALEXANDER", "BOND", "BOONE", "BROWN")) %>%
  ggplot(aes(x=county, y=poptotal, fill = county)) +
  geom_col(position = position_dodge()) +
  scale_fill_brewer(type = "seq", palette = "Set1")

11 Mapping

Use the sf package and a shapefile to visualise data on a map. This shapefile comes from the UK government’s Geoportal. This tool is also a good resource for finding different shapefiles.

library(sf)
library(tidyverse)

# firstly read in a shapefile. Note that when you download a shapefile, it comes with other 
# files like .shx. These all need to be in the same directory as the .shp file itself.
# As such, you can actually just specify the folder name that contains the shapefile and its
# accompanying files. 

eng_reg_map = st_read("./Shapefiles/Eng_regional_2023/RGN_DEC_2023_EN_BFC.shp")
Reading layer `RGN_DEC_2023_EN_BFC' from data source 
  `/Users/chrisdixon/G Drive/R/R script files/How to do stuff in R/Shapefiles/Eng_regional_2023/RGN_DEC_2023_EN_BFC.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 9 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 82668.52 ymin: 5336.966 xmax: 655653.8 ymax: 657536
Projected CRS: OSGB36 / British National Grid
# check that the map is appropriate for your needs by graphing it without any data attached
eng_reg_map %>%
  ggplot() +
  geom_sf(fill   = "white",
          colour = "black") +
  theme_void()

# next, add some data from a preexisting source:
df = data.frame(
  region = c("North West", "North East", "Yorkshire and The Humber", 
             "East of England", "West Midlands", "East Midlands", 
             "South East", "London", "South West"),
  n = c(81, 25, 39, 49, 51, 43, 53, 115, 32)
)

# join this to the sf object and map. 
# use scale_fill_gradient to control fill colours

eng_reg_map %>%
  left_join(df, by = c("RGN23NM" = "region")) %>%
  ggplot(aes(fill = n)) +
  geom_sf(colour = "grey50") +
  theme_void() +
  scale_fill_gradient(name = "Respondents",
                      low = "grey90", high = "grey15", 
                      na.value = "white")

To add points based on geometry:

# say you have a data frame with easting and northing values (these could be lat/long too)
map_df = data.frame(
school = c(1:15),
easting = c(427144, 390498, 391459, 453596, 332195, 439952, 397143, 331548, 504305, 527647, 349039, 437349, 504740, 448965, 166988),
northing = c(433927, 323814, 162039,  86591, 391270, 114064, 157408, 390872, 208312, 169328, 405788, 377158, 103181, 361343,  44078),
pupils  = c(28, 23, 59, 22, 28, 76, 13, 14, 57, 43, 25, 12, 59, 52, 31)
)

# crucial step - check which CRS the original map uses
st_crs(eng_reg_map)$epsg
[1] 27700
# make sure you use the same CRS for the dataframe when you convert it to sf:
schools.sf = st_as_sf(map_df, coords = c("easting", "northing"), remove = FALSE, crs = 27700)

# check that the points themselves work
# plot(schools.sf$geometry)

# add to the map
eng_reg_map %>%
  ggplot() +
  geom_sf(fill   = NA,
          colour = "grey50") +
  geom_sf(data = schools.sf, aes(color = pupils), size = 2) +
  theme_void() +
  scale_colour_gradient(low = "grey", high = "red")

12 Odd bits for sorting later

12.1 Vertical and horizontal lines

mpg %>% 
  ggplot(aes(x=displ, y = cty)) + 
  geom_point() + 
  geom_vline(xintercept = 4.5, colour = "red", linetype="dashed") +
  geom_hline(yintercept = 22.5, colour = "purple", linetype = "twodash")

13 Combining and saving plots

13.1 Facetting

To facet by a single variable, use facet_wrap. Add scales = "free" if axes are different for each group. NOTE: the old way was ~ species, but vars allows you to specify several groupings, e.g. vars(species, sex).

palmerpenguins::penguins %>% 
  ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) + 
  geom_point(aes(colour = island)) +
  facet_wrap(vars(species), 
             nrow = 3,
             scales = "fixed") +
  theme_bw()

To facet by two variables, use facet_grid

palmerpenguins::penguins %>% 
  drop_na(sex) %>%
  ggplot(aes(x=bill_length_mm, y=flipper_length_mm)) + 
  geom_point() +
  facet_grid(species ~ sex) +
  theme_bw()

13.2 Arranging

Use ggarrange from the ggpubr package to arrange plots.

# plot 1
a = gcookbook::countries %>%
    filter(Name == "United Kingdom") %>%
    ggplot(aes(x = Year, y = GDP)) + 
    geom_line() +
    scale_y_continuous(labels = label_comma(big.mark = ",")) +
    theme_bw()

# plot 2
b = gcookbook::countries %>%
    filter(Name == "United Kingdom") %>%
    ggplot(aes(x = Year, y = healthexp)) + 
    geom_line() +
    scale_y_continuous(labels = label_comma(big.mark = ",")) +
    theme_bw()

# combine
ggarrange(a, b, 
          ncol=2, 
          widths = c(1,1),
          labels = c("A", "B"))

You can also nest ggarrange calls, e.g. if you have 3 or more plots

c = gcookbook::countries %>%
    filter(Name == "United Kingdom") %>%
    ggplot(aes(x = Year, y = infmortality)) + 
    geom_line() +
    theme_bw()

ggarrange(a,
          ggarrange(b, c, ncol = 2),
          nrow = 2)

It’s also possible to use a shared legend and title (if applicable to all plots)

a = gcookbook::countries %>%
    filter(str_detect(Name, "United")) %>%
    ggplot(aes(x = Year, y = GDP)) + 
    geom_line(aes(colour = Name, linetype = Name), linewidth = 0.8) +
    theme_bw()

# plot 2
b = gcookbook::countries %>%
    filter(str_detect(Name, "United")) %>%
    ggplot(aes(x = Year, y = healthexp)) + 
    geom_line(aes(colour = Name, linetype = Name), linewidth = 0.8) +
    theme_bw()

# wrap in annotate_figure for common title
annotate_figure(ggarrange(a, b, ncol=2,
          common.legend = T,
          legend = "bottom"),
    top = text_grob("A common title", color = "red", face = "bold", size = 14),
    bottom = text_grob("Data source: \n Countries data set", color = "blue",
                                  hjust = 1, x = 0.99, face = "italic", size = 10),
    fig.lab.pos = "top",
    fig.lab.size = 14)

13.3 Saving

my_plot = ggplot(...)

ppi = 300  # pixels per inch
png("file_name.png", width = 4*ppi, height = 4*ppi, res = ppi)
my_plot
dev.off()

13.4 Looping or iterating

Sometimes you want to produce the same graph for a range of different levels in a factor. This recipe below produces histograms, grouped by ‘EAL’ for every variable listed in the covariates vector. It uses purrr::map to iterate through this list and then invokes cowplot::plot_grid to add these all to one file.

hist_group_fun = function(x, y = NA) {
  ggplot(data, aes(x = .data[[x]], fill = .data[[y]], colour = .data[[y]]) ) +
    geom_histogram(alpha=0.5, position = "identity") +
    theme(legend.text = element_text(size=9))
}

hist_group_fun(x = "IDACI.rank19", y = "EAL")

# make a list of variables to loop through
covariates = c("cpm.raw", "bpvs.raw", "IDACI.rank19")

# iterate through covariates
covariate_plots = map(covariates, ~ hist_group_fun(.x, "EAL") )

# save to a single pdf file
pdf("histograms_all.pdf")
cowplot::plot_grid(plotlist = covariate_plots)
dev.off()

This slightly different recipe takes an outcome variable and grouping factor to produce a histogram. It uses enquo to extract the number of levels in the specified grouping factor which it then supplies to the nrow argument of facet_wrap.

dist_graph = function(data, x, char){
  
  # quote the 'char' variable
  char_var = enquo(char)
  
  # find unique number of values in 'char' by unquoting
  width = data %>% 
    summarise(dist = n_distinct(!!char_var)) %>% as.numeric()
  
  data %>%
    ggplot(aes(x = {{x}}, fill = {{char}})) +
    geom_histogram(binwidth = 1, position  = "identity") +
    facet_wrap(vars({{char}}), 
               scales = "free",
               nrow = width) +
    theme(legend.position = "none",
          axis.title.y = element_blank()) +
    geom_vline(xintercept = 32, colour = "red")
}

dist_graph(c1_c2, x = PHONICSMARK_y1, char = MoB)